Generalized substring selectivity estimation

نویسندگان

Zhiyuan Chen

Flip Korn

Nick Koudas

S. Muthukrishnan

چکیده

In a variety of settings from relational databases to LDAP to Web applications, there is an increasing need to quickly and accurately estimate the count of tuples (LDAP entries, Web documents, etc.) matching Boolean substring queries. In providing such selectivity estimates, the correlation between different occurrences of substrings is crucial. Selectivity estimation for generalized Boolean queries has not been studied previously; our own prior work, which is discussed and extended herein, applies to the case of onedimensional Boolean queries [CKKM00]. Existing methods for the case of multidimensional conjunctive queries approximate selectivities by explicitly storing cross-counts of frequently co-occurring combinations of substrings; estimates are obtained by parsing the query into multidimensional substrings corresponding to stored cross-counts and applying probabilistic formulae. The major problem with these methods is that the number of cross-counts stored by known methods increases exponentially with the number of dimensions (a ‘‘space dimensionality explosion’’) due to the need to capture the correlation amongst the dimensions. Hence, given a limited amount of space, none of the existing methods can reliably give accurate estimates. Moreover, these methods do not generalize to Boolean queries gracefully. We present a novel approach to selectivity estimation for generalized Boolean substring queries with a focus on the two cases of (1) conjunctive multidimensional and (2) Boolean queries. Our approach does not explicitly store crosscounts, but rather generates them on-the-fly. We employ a Monte Carlo technique called set hashing to succinctly represent the set of tuples containing a given substring as a signature vector of hash values; any combination of set hash signatures gives a cross-count when intersected. Thus, using only linear storage, a large number of cross-counts can be generated including those for complex co-occurrences of substrings. The cross-counts generated by our methods are not exact, but they are adequate for selectivity estimation. We present results from an extensive experimental evaluation of our approach on real data sets. For the case of multidimensional conjunctive queries, our approach achieves better accuracy by an order of magnitude, and scales much more gracefully to higher dimensions, than existing methods. Surprisingly, even though our approach involves generating cross-counts on-the-fly, estimation is very fast, taking 200 ms on a data set of size 6 MB: For the case of Boolean queries, our experiments also demonstrate the Corresponding author. E-mail addresses: [email protected] (Z. Chen), [email protected] (F. Korn), [email protected] (N. Koudas), [email protected] (S. Muthukrishnan). 0022-0000/03/$ see front matter r 2003 Published by Elsevier Science (USA). PII: S 0 0 2 2 0 0 0 0 ( 0 2 ) 0 0 0 3 1 4 superiority of this approach over a straightforward independence-based approach wherein correlations are not captured. r 2003 Published by Elsevier Science (USA).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Processing Queries on Road Networks in Spatial Data Base Perspective for Selectivity Estimation

This work mainly focuses on building a framework that is capable of analyzing spatial approximate substring queries, for mainly to solve the selectivity estimation problem of range queries which belongs to road networks represented in spatial databases. The selectivity estimation is nothing but estimating the size of the results i.e., estimating the number of points that presents in a graph whi...

متن کامل

Substring Count Estimation in Extremely Long Strings

To estimate the number of substring matches against string data, count suffix trees (CS-tree) have been used as a kind of alphanumeric histograms. Although the trees are useful for substring count estimation in short data strings (e.g. name or title), they reveal several drawbacks when the target is changed to extremely long strings. First, it becomes too hard or at least slow to build CS-trees...

متن کامل

CXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation

Query optimization in IBM’s System RX, the first truly relational-XML hybrid data management system, requires accurate selectivity estimation of path-value pairs, i.e., the number of nodes in the XML tree reachable by a given path with the given text value. Previous techniques have been inadequate, because they have focused mainly on the tag-labeled paths (tree structure) of the XML data. For m...

متن کامل

Multi-Dimensional Substring Selectivity Estimation

With the explosion of the Internet, LDAP directories and XML, there is an ever greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions. EEective query optimization in this context requires good selectivity estimates. In this paper, we use multi-dimensional count-suux trees as t...

متن کامل

Generalized Substring Compression

In substring compression one is given a text to preprocess so that, upon request, a compressed substring is returned. Generalized substring compression is the same with the following twist. The queries contain an additional context substring (or a collection of context substrings) and the answers are the substring in compressed format, where the context substring is used to make the compression...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

J. Comput. Syst. Sci.

دوره 66 شماره

صفحات -

تاریخ انتشار 2003

Generalized substring selectivity estimation

نویسندگان

چکیده

منابع مشابه

Processing Queries on Road Networks in Spatial Data Base Perspective for Selectivity Estimation

Substring Count Estimation in Extremely Long Strings

CXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation

Multi-Dimensional Substring Selectivity Estimation

Generalized Substring Compression

عنوان ژورنال:

اشتراک گذاری